Journal of the American Medical Informatics Association — Latest Matching Preprints

1

Care Plan Generation for Underserved Patients Using Multi-Agent Language Models: Applying Nash Game Theory to Optimize Multiple Objectives

Basu, S.; Baum, A.

2026-02-25 health informatics 10.64898/2026.02.23.26346934 medRxiv

Top 0.1%

67.4%

Show abstract

BackgroundClinicians in care management programs are often in low supply relative to patient demand, especially in US Medicaid programs, and must simultaneously address clinical risk, time efficiency, and patients social needs. Many studies have shown that large language models may assist in their tasks for summarizing patient care, such as in generating care plans; yet these studies also show that different objectives given to agents often conflict and produce problems for safety, efficiency and equity. We tested whether and to what degree using game theoretic approaches (a Nash bargaining framework) can produce care plans that advance multiple objectives across multiple language models, applying data from a real-world Medicaid cohort. MethodsWe conducted two studies in a cohort of 5,148 activated Medicaid care management patients (69.9% female; 45.7% Black or African American; mean age 40.9 years) enrolled in Virginia and Washington. A retrospective evaluation applied five deterministic strategies to the full cohort to characterize multi-objective trade-offs. A pre-registered controlled paired experiment (N = 200) assigned each patient one Nash-orchestrated multi-agent plan and one compute-matched sequential self-critique plan, generated by locally hosted open-source models (DeepSeek-R1 8B; Llama 3.1 8B) with no patient data leaving local infrastructure. Pre-specified outcomes were Safety, Efficiency, Equity, and Composite (mean of the three), each scored 0-1. Reporting follows CONSORT 2010 and STROBE. ResultsNash orchestration produced a Composite score of 0.755 (95% CI 0.751-0.760) versus 0.742 (95% CI 0.739-0.746) for the compute-matched baseline; the paired difference was 0.013 (95% CI 0.008-0.019; p = 6.20 x 10-). Safety and Efficiency paired differences were small-to-moderate in effect size (Cohens d = 0.327 and 0.543, respectively) with confidence intervals excluding zero. The Equity paired difference was 0.000 (95% CI -0.015 to 0.014; p = 0.987). ConclusionsRole-specialized Nash-orchestrated multi-agent language models produced measurably better Safety and Efficiency care plan quality than a compute-matched baseline under data-residency constraints. The null Equity result demonstrates that multi-objective role specialization does not automatically address equity--equity requires explicit design attention beyond composite weighting--with direct implications for responsible AI deployment in Medicaid care management. Author SummaryCare management programs for Medicaid patients need to address multiple goals at once: covering clinical risks, prioritizing the most impactful interventions, and recognizing the social barriers that affect whether patients can follow through on care plans. Prior research shows that automation tools powered by a single AI model tend to optimize for one of these goals at a time, sacrificing the others. We tested whether organizing several specialized AI agents -- each focused on a different goal -- and then combining their recommendations through a mathematical framework called Nash bargaining could produce better overall care plans for a real Medicaid population. We found that this multi-agent approach produced care plans that the AI judge rated as meaningfully safer and more efficient than plans generated by a single AI model using the same total amount of computation. However, the multi-agent approach did not produce plans that were more equitable in addressing patients social needs, suggesting that equity requires more direct attention as a design target rather than emerging from multi-objective combination alone. All AI inference was performed on locally hosted computers, with no patient information sent to outside services, reflecting the privacy requirements of real-world Medicaid care management programs.

2

Identifying Reasons for ACEI/ARB Non-Use in CKD Using Scalable Clinical NLP with Schema-Guided LLM Augmentation

Al-Garadi, M.

2026-02-12 health informatics 10.64898/2026.02.10.26346025 medRxiv

Top 0.1%

52.6%

Show abstract

IMPORTANCEAlthough angiotensin-converting enzyme inhibitors (ACEIs) and angiotensin receptor blockers (ARBs) are recommended for people with chronic kidney disease (CKD), they remain underused. Barriers to adherence, such as adverse effects or patient refusal, are frequently embedded within unstructured clinical narratives and are therefore inaccessible to structured data analytics. Scalable natural language processing (NLP) approaches are needed to identify these barriers and support guideline-concordant care. OBJECTIVETo develop and evaluate an NLP model capable of identifying documented reasons for ACEI/ARB non-use within clinical notes of people with CKD in the Veterans Affairs (VA) healthcare system. DESIGN, SETTING, AND PARTICIPANTSThis retrospective study analyzed electronic health record data from 2005 to 2024 including people aged 18 to 80 years with CKD, defined by an estimated glomerular filtration rate (eGFR) of 20-60 mL/min/1.73 m2 and presence of albuminuria, across multiple VA medical centers. NLP models were trained on 1,025 manually annotated notes and further augmented with 4,600 synthetic examples generated through schema-guided large language model prompting. MAIN OUTCOMES AND MEASURESThe primary outcome was model performance in identifying notes containing at least one documented reason for ACEI/ARB non-use, evaluated using F1-score, precision, and recall. Secondary outcomes included model learning curve analyses and the effect of synthetic data augmentation on classification performance. RESULTSThe most common documented reasons for ACEI/ARB non-use were acute kidney injury (29.6%), increased creatinine (12.4%), cough (11.2%), and hypotension-related symptoms (11.1%). Across modeling approaches, training with synthetic data augmentation improved detection of notes containing reasons for non-use. Performance gains were statistically significant across all models (McNemar test, P < .05), with the random forest model using Nomic embeddings achieving the highest performance (F1 score, 0.79; 95% CI, 0.68-0.90). CONCLUSIONS AND RELEVANCEWe identified documented reasons for ACEI/ARB non-use (including both failures to initiate therapy and discontinuation after prior use) from unstructured text using an NLP method that does not require massive, expensive computing at inference time. By augmenting training data with schema-guided synthetic notes, we achieved robust, privacy-preserving performance within an NLP framework. This approach may support scalable clinical decision support systems to promote guideline-concordant prescribing.

3

Evaluating a Locally Deployed 20-Billion Parameter Large Language Model for Automated Abstract Screening in Systematic Reviews

Moreira Melo, P. H.; Poenaru, D.; Guadagno, E.

2026-03-04 health informatics 10.64898/2026.03.04.26347506 medRxiv

Top 0.1%

42.0%

Show abstract

BackgroundSystematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. ObjectiveTo evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic reviews using a sensitivity-enhanced prompting strategy, with blind expert adjudication of all discordant human-AI cases. MethodsWe deployed GPT-OSS:20B locally using Ollama and evaluated its performance across three systematic reviews: AI applications in pediatric surgical pathology (n=3,350), LLM applications in electronic health records (n=4,326), and parental stress/caregiver burden in surgically treated children (n=8,970). A sensitivity-enhanced prompting strategy instructing the model to include abstracts when uncertain was employed. All discordant cases underwent blind expert adjudication. ResultsAcross 16,646 abstracts, the LLM demonstrated variable sensitivity after expert adjudication: 100% in SR1, 95.7% in SR2, and 85.7% in SR3. Expert adjudication identified 11 human screening errors across all reviews that the LLM had correctly classified. The LLM completed screening 4.7 times faster than human reviewers. ConclusionsA locally deployed LLM with sensitivity-enhanced prompting shows promising performance for systematic review abstract screening, particularly for technology-focused topics. Performance variability across domains suggests that screening accuracy depends partly on the objectivity of inclusion criteria. We recommend deploying LLMs as second screeners alongside human reviewers until performance is more fully validated across diverse domains.

4

Virtual Pooling Enables Accurate, End-to-End Multi-Institutional Study Execution and Causal Inference Without Centralized Data Sharing

Ahmad, I.; Ayati, A.; Liu, K.; Ko, S.; Bonine, N.; Tabano, D.; Malik, N.; Lyu, T.; Zheng, K.; Rudrapatna, V. A.; Gupta, T.

2026-03-26 health informatics 10.64898/2026.03.24.26349123 medRxiv

Top 0.1%

40.1%

Show abstract

Background: Multicenter retrospective studies often rely on bringing patient-level data together into a single repository, introducing substantial regulatory and operational barriers. Federated analytics provides a privacy-preserving alternative; however, existing implementations are complex to use, require extensive manual effort for data cleaning, preprocessing, and harmonization, and produce approximate rather than ground-truth results for many biostatistical methods. Virtual Pooling (VP) is a recently developed multicenter study execution platform designed to overcome these limitations. In this study, we evaluate whether VP can replicate a published multicenter retrospective study end-to-end---including data preprocessing, regression analysis, and causal inference---without centralized data aggregation. Methods: We deployed VP at the University of California, San Francisco (UCSF) and the University of California, Irvine (UCI) and attempted to replicate a published study of diabetic eye disease screening practices (UCSF N = 2,592; UCI N = 5,642). VP supported all phases of this two-center study, including data cleaning, harmonization, feature engineering, imputation, propensity score estimation, patient matching, and model estimation, all conducted through a single interface without manual coordination between centers. We verified preprocessing correctness and compared descriptive statistics and causal effect estimates with those from the original study, which relied on data transfers across the centers. We also measured the latency overhead introduced by VP. Results: VP was deployed without hospital infrastructure changes, new or non-standard governance agreements, or dedicated IT support. All preprocessing steps executed correctly, with individual preprocessing operations and descriptive statistics completing in under 1 second, logistic regression in under 10 seconds, and propensity score matching in under 30 seconds. Descriptive statistics for all 30 baseline covariates were numerically identical to the original study. Univariate regression results identifying predictors of completed screening were also identical, with recent eye clinic referral (OR = 56.7; 95% CI: 42.1-76.4) and history of eye disease (OR = 6.4; 95% CI: 5.6-7.4) as the strongest predictors. VP also reproduced pooled causal estimates of automated referrals, showing an increase in screening completion from 21% to 36% at UCSF and from 13% to 34% at UCI. Conclusion: VP enables accurate, end-to-end multicenter clinical studies without centralized data sharing. By providing a single interface that supports the full analytical workflow, from uncleaned and unharmonized data through statistical results, and by exactly reproducing pooled results, VP eliminates manual coordination and data transfers across centers. These findings validate its practical potential to transform multicenter retrospective studies, particularly in contexts where data sharing is time-consuming, bureaucratic, or restricted.

5

Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models

Wang, X.; Faviez, C.; Vincent, M.; Andrew, J. J.; Le Priol, E.; Saunier, S.; Knebelmann, B.; Zhang, R.; Garcelon, N.; Burgun, A.; Chen, X.

2026-03-31 health informatics 10.64898/2026.03.30.26349388 medRxiv

Top 0.1%

40.1%

Show abstract

Objectives Rare diseases often require longitudinal monitoring to characterise progression, yet much clinical information remains locked in unstructured electronic health records (EHRs). Efficient recovery of such data is critical for accurate prognostic modelling and clinical trial preparation. We aimed to develop and evaluate a small language model (SLM)-based pipeline for extracting longitudinal information from French clinical notes of patients with rare kidney diseases. Methods As a use case, we focused on serum creatinine, a key biomarker of kidney function. We analyzed 81 clinical notes comprising 200 measurements (triplet of date, value and unit). Four open-source SLMs (Mistral-7B, Llama-3.2-3B, Qwen3-4B, Qwen3-8B) were systematically tested with different prompting strategies in French and English. Outputs were post-processed to standardize formats and resolve inconsistencies, and performance was assessed across model size, prompting, language, and robustness to text duplication. Results All SLMs extracted structured triplets, with F1-scores ranging from 0.519 to 0.928 (Qwen3-8B), outperforming the rule-based baseline. Larger models generally performed better, while prompting strategy and language had modest effects across models. SLMs also showed variable robustness to duplicated content common in real-world EHR notes. Discussion Lightweight, locally deployable language models can accurately extract longitudinal biomarkers from unstructured clinical notes. Our findings highlight their practicality for rare diseases where data scarcity often limits task-specific model training. Conclusion SLMs provide a privacy-preserving and resource-efficient solution for recovering longitudinal biomarker trajectories from unstructured notes, offering potential to advance real-world research and patient care in rare kidney diseases.

6

Safety and Utility of an Agentic Large Language Model-Based Hospital Course Summarizer: A Prospective Real-World Pilot Study

Grolleau, F.; Liang, A. S.; Keyes, T.; Ma, S. P.; Lew, T.; Huynh, T. R.; Steele, N.; Chung, P.; Qin, P.; Chandra, G.; Wang, S. F.; Mullen, E.; Carpenter, L.; Hoppenfeld, M.; Morrin, M.; Kyerematen, B. A.; Ambers, N.; Kotecha, N.; Alsentzer, E.; Hom, J.; Shah, N. H.; Schulman, K.; Chen, J. H.

2026-02-06 health informatics 10.64898/2026.02.05.26345607 medRxiv

Top 0.1%

39.9%

Show abstract

ImportanceHigh-quality discharge summaries are essential for safe care transitions but contribute substantially to clinician documentation burden and burnout. While retrospective studies suggest large language models (LLMs) can generate clinical summaries of comparable quality to physicians, prospective data on their safety, utility, and impact on clinician well-being in real-world environments are lacking. ObjectiveTo evaluate the safety, utilization, and impact on clinician burden of MedAgentBrief, an LLM-based agentic workflow for generating hospital course summaries, during prospective clinical deployment. Design, Setting, and ParticipantsSingle-arm prospective pilot study encompassing 384 hospital discharges at one academic inpatient medicine unit from August 1 to October 11, 2025, with baseline comparisons drawn from April 9 to July 31, 2025. InterventionMedAgentBrief, a custom agentic AI workflow utilizing Gemini 2.5 Pro, generated draft hospital course summaries nightly using the patients history and physical and daily progress notes. Drafts were securely emailed to physicians daily for review and optional use. Main Outcomes and MeasuresThe primary outcome was physician-reported potential for and severity of harm from unedited summaries (AHRQ Common Format Harm Scale). Secondary outcomes included utilization rate, error types (omissions, inaccuracies, hallucinations), time spent in discharge summaries (EHR logs), and changes in cognitive burden (NASA Task Load Index [NASA-TLX]) and burnout (Stanford Professional Fulfillment Index [PFI] Work Exhaustion Scale). ResultsThe system generated 1274 summaries. Of 384 discharges, physicians utilized AI content in 219 (57%) cases. Feedback on 100 summaries (40.2%) noted omissions (25%) and inaccuracies (20%) but rare hallucinations (2%). Physicians rated 88% of unedited summaries as having no harm potential and 1% as likely to cause moderate harm; no severe harm was reported. Physician burnout scores decreased significantly (1.75 vs 1.20; P = .03). Time savings were heterogeneous: 71% of physicians saw reductions in median documentation time (up to 2.9 minutes). Conclusions and RelevanceAn LLM-based agentic workflow produced hospital course summaries that were frequently utilized with mild to minimal risk of harm identified. The intervention was associated with a significant reduction in physician burnout, supporting the viability of AI summarization to mitigate documentation burden.

7

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv

Top 0.1%

39.3%

Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

8

Agentic Trial Emulation to Learn Health System-specific Drug Effects At Scale

Kauffman, J.; Duan, L.; Gelman, S.; Klang, E.; Sakhuja, A.; Bhatt, D. L.; Reddy, V. Y. Y.; Charney, A.; Nadkarni, G.; Qu, Y.; Huang, K.; Lampert, J.; Glicksberg, B. S.

2026-02-20 health informatics 10.64898/2026.02.19.26346539 medRxiv

Top 0.1%

38.3%

Show abstract

ObjectiveElectronic Health Record (EHR)-based trial emulation can support translation of randomized clinical trial (RCT) evidence into practice, yet emulations often diverge from published RCT results. We hypothesized that these discrepancies are structured and learnable properties of a health systems data-generating process, and that autonomous agentic workflows can generate discrepancies at the scale required for cumulative learning. Materials and MethodsWe developed an agentic trial emulation framework that (1) uses an autonomous LLM agent (Biomni) to execute an end-to-end, instruction-driven emulation pipeline against an OMOP CDM database and (2) calibrates EHR estimates to RCT results with a Bayesian hierarchical model. Biomni performed protocol parsing, OMOP concept set construction, cohort building, confounder adjustment, and treatment effect estimation; it also synthesized literature-derived, comparison-specific priors for expected EHR-RCT disagreement. Five atrial fibrillation anticoagulation trials were emulated using Mount Sinais OMOP-mapped EHR, with three independent runs per trial to quantify agent-induced analytic variability. Discrepancies between EHR-derived and published log-hazard ratios were modeled as the sum of a literature-informed reproducibility expectation, an institution-specific systematic shift, and residual heterogeneity. Performance was assessed using leave-one-out cross-validation across four in-domain DOAC-versus-warfarin trials, with one out-of-distribution evaluation (apixaban versus aspirin). ResultsIn pooled leave-one-out validation, calibration reduced mean absolute error from 0.567 to 0.224 log-hazard ratio (60.5% reduction) and achieved 100% empirical coverage of 95% posterior predictive intervals across held-out trials (4/4). The posterior institution-specific shift was consistently positive across folds (median 0.364-0.580), indicating systematic attenuation of DOAC benefit in the local EHR beyond literature-expected disagreement; residual heterogeneity was moderate (median 0.199-0.264). For the out-of-distribution AVERROES trial, calibrated error decreased from 0.379 to 0.051 (86.5% reduction), with the published effect within the 95% credible interval. Discussion and ConclusionAutonomous emulation with agents enables repeated, standardized trial replications that convert EHR-RCT disagreement into data for learning institution-level transport properties. Separating comparison-specific reproducibility expectations from system-level shifts yields calibrated, uncertainty-aware local interpretations of trial evidence.

9

PRE-CISE: A PRE-calibration Coverage, Identifiability, and SEnsitivity analysis workflow to streamline model calibration

Gracia, V.; Goldhaber-Fiebert, J. D.; Alarid-Escudero, F.

2026-03-02 health policy 10.64898/2026.02.27.26346591 medRxiv

Top 0.1%

34.8%

Show abstract

PurposeWe introduce PRE-CISE, a pre-calibration workflow that integrates coverage analysis, local sensitivity, and collinearity diagnostics to streamline model calibration and transparently address nonidentifiability. We demonstrate the benefits of PRE-CISE using a four-state Sick-Sicker Markov testbed and a COVID-19 case study. MethodsPRE-CISE begins with a coverage analysis to verify that model outputs generated with parameter sets drawn from their prior distribution span calibration targets, followed by local sensitivities to quantify the influence of parameters on model outputs, guiding the resizing of the prior distribution bounds to improve coverage. Identifiability is then assessed via collinearity analysis; large indices indicate practical nonidentifiability. For the testbed model, we calibrated 3 parameters to survival, prevalence, and the proportion of Sick to Sicker at 10, 20, and 30 years. For the COVID-19 model, we calibrated 11 parameters to match daily confirmed incident cases. Bayesian calibration was conducted on both analyses. ResultsCoverage analyses flagged initial misfits; local sensitivities identified the Sick-to-Sicker transition probability has a greater effect on model outputs, and resizing its prior distribution bounds improved coverage. Collinearity analyses showed that combining multiple calibration targets across time points enabled recovery of all three parameters. In the COVID-19 model, local sensitivity analyses prioritized time-varying detection rates and contact-reduction effects, reducing the search space, thereby improving calibration efficiency. Daily incident case calibration targets yielded collinearity indices below practical thresholds (e.g., < 15) for all parameter combinations, whereas weekly calibration targets were larger and closer to the cutoff. ConclusionsPRE-CISE provides a practical, transparent pathway that helps modelers refine prior distribution bounds and calibration targets before intensive calibration, improving uncertainty reporting and strengthening the reliability of model-based health policy analyses.

10

PhenoSS: Phenotype semantic similarity-based approach for rare disease prediction and patient clustering

Chen, S.; Nguyen, Q. M.; Hu, Y.; Liu, C.; Weng, C.; Wang, K.

2026-03-02 health informatics 10.64898/2026.02.26.26347219 medRxiv

Top 0.1%

28.7%

Show abstract

ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We aim to develop a scalable and statistically principled framework to address these limitations for rare disease prediction and patient stratification. MethodsWe developed PhenoSS, a Gaussian copula-based framework that models disease-specific marginal prevalence of HPO terms while capturing their joint dependencies through a multivariate normal distribution. Phenotype frequencies were estimated using external curated resources, including OARD (Open Annotations for Rare Diseases) and HPO annotations. PhenoSS supports both pair-wise phenotype similarity calculation for patient clustering and posterior odds estimation for patient-specific disease prioritization. A batch-effect correction module mitigates systematic phenotyping differences across datasets. ResultsAcross diverse simulation scenarios, PhenoSS demonstrated robust disease-prediction performance and consistently improved accuracy after batch-effect correction. In real electronic health record (EHR) data, PhenoSS identified clinically meaningful patient clusters and effectively distinguished patients with different rare diseases. In disease prioritization tasks, PhenoSS achieved competitive performance with existing methods, particularly for patients exhibiting sparse or noisy phenotype annotations. ConclusionPhenoSS provides a statistically interpretable framework for modeling phenotypic heterogeneity in rare disease research and is adaptable to other structured clinical vocabularies such as SNOMED-CT and ICD codes.

11

JARVIS, should this study be selected for full-text screening? Performance of a Joint AI-ReViewer Interactive Screening tool for systematic reviews

Barreto, G. H. C.; Burke, C.; Davies, P.; Halicka, M.; Paterson, C.; Swinton, P.; Saunders, B.; Higgins, J. P. T.

2026-04-11 health informatics 10.64898/2026.04.08.26350384 medRxiv

Top 0.1%

28.5%

Show abstract

BackgroundSystematic reviews are essential for evidence-based decision making in health sciences but require substantial time and resource for manual processes, particularly title and abstract screening. Recent advances in machine learning and large language models (LLMs) have demonstrated promise in accelerating screening with high recall but are often limited by modest gains in efficiency, mostly due to the absence of a generalisable stopping criterion. Here, we introduce and report preliminary findings on the performance of a novel semi-automated active learning system, JARVIS, that integrates LLM-based reasoning using the PICOS framework, neural networks-based classification, and human decision-making to facilitate abstract screening. MethodsDatasets containing author-made inclusion and exclusion decisions from six published systematic reviews were used to pilot the semi-automated screening system. Model performance was evaluated across recall, specificity and area under the curve precision-recall (AUC-PR), using full-text inclusion as the ground truth. Estimated workload and financial savings were calculated by comparing total screening time and reviewer costs across manual and semi-automated scenarios. ResultsAcross the six review datasets, recall ranged between 98.2% and 100%, and specificity ranged between 97.9% and 99.2% at the defined stopping point. Across iterations, AUC-PR values ranged between 83.8% and 100%. Compared with human-only screening, JARVIS delivered workload savings between 71.0% and 93.6%. When a single reviewer read the excluded records, workload savings ranged between 35.6 % and 46.8%. ConclusionThe proposed semi-automated system substantially reduced reviewer workload while maintaining high recall, improving on previously reported approaches. Further validation in larger and more varied reviews, as well as prospective testing, is warranted.

12

Can Machine Learning Algorithms use Contextual Factors to Detect Unwarranted Clinical Variation from Electronic Health Record Encounter Data during the Treatment of Children Diagnosed with Acute Viral Pharyngitis

mcowiti, a. O.; Neaimeh, Y. R.; Gu, J.; Lalani, Y.; Newsome, T. C.; nguyen, Y. H.; Shrager, S.; Rasmy, L. O.; Fenton, S. H.

2026-03-02 health informatics 10.64898/2026.02.23.26346757 medRxiv

Top 0.1%

27.8%

Show abstract

Rationale, Aims and ObjectivesUnwarranted clinical variation (UCV) in patient care often arises from contextual factors and contributes to increased costs, unnecessary treatments, and deviations from evidence-based practice. Detecting UCV is challenging due to the complexity of care decisions. Current approaches rely on centralized data aggregation and mixed-effects regression, which estimate relative variation but cannot detect absolute variation. Moreover, machine learning (ML) methods leveraging contextual factors for UCV detection are lacking. The objective is to demonstrate the feasibility of ML for identifying absolute UCV using contextual features extracted from electronic health records (EHR) and identify the factors correlated with UCV in treating acute viral pharyngitis in children. MethodsWe conducted a retrospective study of pediatric ambulatory visits (ICD-10 J02.8) at an academic health system. The use case focused on unwarranted antibiotic prescriptions for acute viral pharyngitis. We trained ensemble ML models--Random Forest, CatBoost, and Explainable Boosting Machine (EBM)--using encounter-level EHR data. Performance was evaluated using nested cross-validation and AUC metrics. We also compared CatBoost models trained on curated (gold-standard) versus weak labels. ResultsAll three ML models demonstrated robust performance, with a median AUC of 0.91, using data from 24 clinics, 81 providers, and 122 patients within an academic health system. CatBoost models trained on weak labels exhibited performance comparable to those trained on gold-standard labels. Feature importance analysis indicated that site-level and provider-level case volumes were the most influential predictors, followed by provider credential, years of experience, and encounter type. Notably, lower provider case volumes were associated with a reduced likelihood of inappropriate treatment. ConclusionsClassical ML models can effectively detect absolute UCV using contextual EHR features. Explainable models such as EBM offer interpretability critical for clinical adoption. These findings support ML-based approaches as scalable alternatives to traditional statistical methods for UCV detection without requiring centralized data analysis.

13

Development and validation of an algorithm to identify front-line clinicians using EHR audit log data

Baratta, L. R.; Wang, J.; Osweiler, B. W.; Lew, D.; Eiden, E.; Kannampallil, T. G.; Lou, S. S.

2026-02-16 health informatics 10.64898/2026.02.13.26346268 medRxiv

Top 0.1%

27.6%

Show abstract

BackgroundInterprofessional teams are central to high quality patient care. However, identifying the clinician primarily responsible for a patient requires labor-intensive methodologies. Although electronic health record (EHR) audit logs offer a scalable alternative, its use for identifying frontline clinicians is underdeveloped. ObjectiveTo develop and validate an algorithm utilizing EHR audit logs to identify the primary frontline clinician per patient day of an encounter and to describe care continuity patterns. MethodThis was a cross-sectional cohort study of adult inpatient medicine encounters at 12 hospitals in a single health system using a shared EHR. Admissions from February 1, 2023-April 30, 2023, with length of stay of at least 3 days and without an intensive care unit admission were included. Four algorithm iterations were designed to identify the attending physician, resident, or advanced practice provider primarily responsible for patient care on each patient-day. Performance of each algorithm was compared with manual chart review on 1,401 patient-days from 246 randomly sampled patient encounters. Accuracy between an algorithm and the chart review standard was compared using McNemars test with Bonferroni adjusted p-values. ResultsThe best performing algorithm correctly identified the primary clinician responsible for patient care on 91% of patient-days (1,268/1,401), outperforming the naive approach using frequency of actions (78% accuracy, 1,098/1,401, p<0.001). Algorithm errors were attributable to misidentified specialty and ambiguity on days with transitions of care or shared responsibilities between clinicians. The best performing algorithm was applied to the entire cohort (5,801 encounters and 34,001 patient-days) where it identified attending physicians, resident physicians, and APPs as the frontline clinician for 26,750 (79%), 3,106 (9%), and 4,145 (12%) of patient days respectively. Each encounter had a median of 1 (IQR 0-2) handoff between frontline clinicians. ConclusionsWe developed a scalable, audit log-based algorithm to determine the front-line clinician with excellent accuracy compared with manual chart review.

14

Diagnostic Accuracy of Large Language Models for Rare Diseases: A Systematic Review and Meta-Analysis

Nguyen, M.-H.; Yang, C.-T.; Cassini, T. A.; Ma, F.; Hamid, R.; Bastarache, L.; Peterson, J. F.; Xu, H.; Li, L.; Ma, S.; Shyr, C.

2026-03-27 genetic and genomic medicine 10.64898/2026.03.26.26349194 medRxiv

Top 0.1%

27.6%

Show abstract

Background: Large language models (LLMs) have been evaluated as tools to assist rare disease diagnosis, yet evidence on their accuracy remains fragmented. We conducted a systematic review and meta-analysis to synthesize the available evidence on the diagnostic performance of LLMs, identify sources of heterogeneity, and evaluate the current evidence base for clinical translation. Methods: We searched PubMed, Embase, Web of Science, Cochrane Library, arXiv, and medRxiv (January 2020-February 2026). Full-text articles and preprints were considered for inclusion. Eligible studies applied LLM-based systems to generate differential diagnoses for rare diseases and provided Recall@1 (R@1; proportion with the correct diagnosis ranked first). We pooled R@1 using Freeman-Tukey double arcsine transformation with DerSimonian-Laird random-effects models. Pre-specified subgroup analyses examined LLM knowledge augmentation strategy and input modality. Because both retained high residual heterogeneity, we conducted a post-hoc exploratory analysis of evaluation benchmark disease composition, mapping diseases from major benchmarks to Orphanet prevalence classifications. Risk of bias was assessed using a modified QUADAS-3 instrument. Findings: We identified 902 records, of which 564 were screened and 15 studies were eligible. These 15 studies contributed 19 system-dataset entries to the meta-analysis (total N=39,529 cases). The pooled R@1 was 43.3% (95% CI 35.1-51.6; I2=99.6%). Augmented LLM systems (agent-based reasoning, retrieval, or fine-tuning; k=8) achieved R@1 of 52.5% (42.0-62.9) versus 35.4% (30.6-40.4) for standalone LLMs (k=11; p=0.004). Post-hoc exploratory analysis indicated that evaluation benchmark disease composition was associated with differences in diagnostic performance: R@1 was lower on the Phenopacket Store dataset, which contained a higher proportion of ultra-rare diseases (52.8%; k=2), than on RareBench (29.3%; k=6) at 21.7% (18.2-25.5) versus 52.0% (40.7-63.2; p<0.001). All 19 system-dataset entries were assessed to be at high risk of bias, most commonly due to potential data leakage and limited reproducibility. No study provided prospective clinical validation. Interpretation: Diagnostic performance of LLM-based systems for rare diseases varied substantially across evaluation benchmarks. Post-hoc exploratory analysis indicated that performance was associated with benchmark disease composition. Performance was higher in benchmarks containing fewer ultra-rare diseases and in systems incorporating external knowledge at inference time. However, all included studies were at high risk of bias, and none reported prospective clinical validation. These findings highlight the need for prevalence-stratified evaluation benchmarks and independent prospective studies before clinical deployment. Funding: This work was supported in part by the National Institutes of Health Common Fund, grant 15-HG-0130 from the National Human Genome Research Institute, U01NS134349 from the National Institute of Neurological Disorders and Stroke, R00LM014429 from the National Library of Medicine, and the Potocsnak Center for Undiagnosed and Rare Disorders.

15

Governing Decisions of Probability Cutoffs in Clinical AI Deployment: A Case Study of Asthma Exacerbation Prediction

Zheng, L.; Agnikula Kshatriya, B. S.; Ohde, J.; Rost, L.; Malik, M.; Peterson, K.; Brereton, T.; Loufek, B.; Pereira, T.; Gai, C.; Park, M.; Hartz, M.; Fladager-Muth, J.; Wi, C.-I.; Tao, C. J.; Garovic, V.; Juhn, Y. J.; Overgaard, S. M.

2026-03-22 health informatics 10.64898/2026.03.18.26348562 medRxiv

Top 0.1%

26.1%

Show abstract

Models that estimate the probability of an adverse clinical outcome require an operational cutoff to translate continuous estimated probabilities into discrete labels that can trigger clinical action. Although statistical methods identify optimal cut-offs, threshold selection ultimately reflects value judgments regarding harm tolerance, resource allocation, and workflow feasibility. We describe a governance-informed approach to selecting a deployment threshold for an asthma exacerbation (AE) prediction model integrated into clinical workflows. Using prevalence-adjusted performance metrics and real-world provider capacity modeling, we evaluated multiple candidate thresholds and quantified downstream workload and missed-event trade-offs. We demonstrate that statistically optimal thresholds may produce operationally infeasible alert volumes or unacceptable miss rates. We propose a structured threshold governance framework integrating statistical performance, clinical utility, stakeholder input, and human oversight safeguards. This case illustrates how threshold decisions should be treated as organizational governance processes rather than purely technical optimizations.

16

Trustworthy personalized treatment selection: causal effect-trees and calibration in perioperative medicine

Mittelberg, Y.; Stiglitz, D. K.; Kowadlo, G.

2026-03-04 health informatics 10.64898/2026.03.03.26347440 medRxiv

Top 0.1%

23.5%

Show abstract

BackgroundPersonalized medicine promises to tailor treatments to the individual, but it carries a hidden risk: mistaking statistical noise for actionable clinical insight. Current machine learning approaches often provide predictions, but fail to inform clinicians when those predictions are unreliable. ObjectiveDevelop a deployment-readiness framework that integrates causal inference, interpretable effect-trees, and calibration assessment to distinguish actionable signal from unreliable variation, and to support treatment selection only when the estimated benefit is both reliable and clinically meaningful. MethodsUsing retrospective observational cohort EHR data from the INSPIRE perioperative dataset (N>130,000 surgical operations, 2011-2020), we estimated treatment effects using causal forests with double machine learning, benchmarked against other causal methods to assess convergence. We used the estimated causal effects to create effect-trees and translated estimates into interpretable rules. We validated the treatment recommendations by assessing subgroup calibration to identify which groups were reliable for treatment selection. ResultsIn a prostate procedures case study (neuraxial versus general anesthesia; total N=2,822), neuraxial anesthesia was associated with substantially lower post-operative opioid use (ATE = -1.38 opioid medications, 95% CI [-1.62, -1.15]). The effect-tree produced five clinically interpretable subgroups using BMI, ASA status, and age, with effects ranging from -1.10 to -1.59 opioid medications. Calibration analysis identified four of five subgroups as reliable for deployment (calibration error < 0.08), while one small subgroup (N=250) showed higher calibration error (0.44), illustrating how the framework rates unreliable heterogeneity. ConclusionsIndividual prediction heterogeneity does not automatically justify clinical personalization. By combining effect-trees with calibration, this framework distinguishes actionable heterogeneity from noisy heterogeneity (detectable but unreliable). This approach transforms causal machine learning from a black box into a validated decision support system that enables selective deployment of treatment decision rules.

17

Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv

Top 0.1%

23.4%

Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

18

Monte Carlo Committee Simulation with Large Language Models for Predicting Drug Reimbursement Recommendations and Conditions: A Novel Neurosymbolic AI Approach

Janoudi, G.; Rada (Uzun), m.; Yasinov, E.; Richter, T.

2026-03-03 health policy 10.64898/2026.03.02.26347434 medRxiv

Top 0.1%

23.2%

Show abstract

BackgroundHealth technology assessment (HTA) agencies issue reimbursement recommendations that determine patient access to new therapies. Predicting these outcomes would enable sponsors to optimize market access strategies and health systems to anticipate budget impacts. However, traditional machine learning approaches require extensive manual feature extraction and predict only categorical outcomes, not the specific conditions attached to recommendations. MethodsWe developed Monte Carlo Committee Simulation, a neurosymbolic system that simulates multi-panelist deliberation using 14 persona-conditioned large language model panelists with weighted voting and uncertainty quantification. We conducted a temporal external validation study on CDA-AMC (Canadas Drug Agency) sponsor-submitted recommendations published between October 2024 and December 2025 (n=67), after the knowledge cutoff of the underlying models, ensuring predictions reflected reasoning rather than memorization. The system predicted both recommendation category (Reimburse with Conditions, Do Not Reimburse) and five condition categories (Population Restrictions, Prescriber/Setting Requirements, Continuation Conditions, Economic Conditions, Evidence Conditions). ResultsOn submissions where the system expressed confidence (n=44), recommendation prediction achieved 93.2% accuracy (95% CI: 84.1-100.0%), exceeding the 91.8% (95% CI: 83.7-98.0%) majority class baseline. The system demonstrated superior discrimination versus chance level (AUROC 0.817, 95% CI: 0.45-0.99, vs 0.500) and calibrated confidence estimates (ECE = 0.091). Pre-specified Strength of Mandate stratified accuracy from 96.8% (High, 95% CI: 90.3-100.0%) to 40.0% (Weak, 95% CI: 0.0-80.0%), with 83.3% of errors occurring in cases flagged as uncertain (p=0.0025). Analysis of the 5 abstained cases confirmed 40.0% accuracy, validating the systems identification of uncertain predictions. For condition prediction, the system achieved 48.8% subset accuracy, requiring correct simultaneous prediction of all 5 condition categories (25 = 32 possible combinations), and 86.3% Hamming accuracy versus 25.8% for a no-conditions baseline. Per-category accuracy ranged from 68.3% (Continuation Conditions) to 97.6% (Economic Conditions), with Continuation Conditions demonstrating the strongest discriminative ability (AUROC 0.896, 95% CI: 0.79-0.98). ConclusionsMonte Carlo Committee Simulation enables a shift from reactive to proactive market access: anticipating specific reimbursement conditions before committee review, with calibrated confidence that identifies which predictions to trust. Validated on temporally separated data the models could not have memorized, the system can be positioned as a forecasting aid that complements rather than replaces human deliberation.

19

Leveraging Expert Knowledge and Causal Structure Learning to Build Parsimonious Models of Acute Brain Dysfunction in the Pediatric Intensive Care Unit

Perez Claudio, E.; Horvat, C.; Au, A. K.; Clark, R. S. B.; Taylor, M. W.; Cooper, G. F.; Li, R.; Nourelahi, M.; Hochheiser, H.

2026-02-18 health informatics 10.64898/2026.02.17.26345661 medRxiv

Top 0.1%

23.1%

Show abstract

Machine learning adoption in clinical decision support systems remains limited by concerns about transparency and robustness. Causal structure learning (CSL) combined with expert knowledge may address these concerns by identifying potentially causal predictors, enabling more interpretable and clinically aligned models. In this study, we show that by integrating clinician expertise with CSL algorithms we can identify plausible causal drivers of acquired acute brain dysfunction (ABD) in the pediatric intensive care unit (PICU), which enables the development of parsimonious predictive models without substantial loss in performance. To do so, we analyzed 18,568 PICU encounters from the University of Pittsburgh Medical Center Childrens Hospital (2010-2022) and elicited knowledge from experienced clinicians. Encounters with acquired ABD were defined using the validated ABD computable phenotype. Expert knowledge was elicited from four clinicians through iterative interviews to construct a consensus directed acyclic graph (DAG). Clinician consensus achieved acceptable inter-rater reliability (Fleiss Kappa = 0.62) after two rounds of interviews and identified 16 biomarkers as potential causes of acquired ABD. Two CSL algorithms, GOLEM and PC-MB, were applied to enrich the clinicians consensus DAG. The PC-MB algorithm showed 78% concordance with expert consensus, while GOLEM showed 46%. Together, the CSL algorithms identified seven biomarkers as potential causes that were not included in the clinicians DAG: blood urea nitrogen, creatinine, dobutamine, glucose, potassium, PTT, SpO2. Using multiple variations of the enriched DAGs, XGBoost models were trained using biomarkers identified as potential causes of acquired ABD; these were evaluated primarily by area under the precision-recall curve (AUPRC). Models trained on the intersection of clinician consensus and PC-MB DAGs achieved an AUPRC of 0.79 (95% CI: 0.75-0.82) using only 14 biomarkers, compared to 0.81 (95% CI: 0.78-0.84) for the control model using all 45 biomarkers. When restricted to vitals and laboratory results alone, the best-performing model achieved an AUPRC of 0.77. Combining clinical expertise with causal structure learning enables the identification of causal hypotheses consistent with the clinical understanding of the participating clinicians and the development of parsimonious predictive models for acquired ABD in the PICU.

20

From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

Kim, H.; Kim, M.; Kim, S.; You, S. C.

2026-03-14 health informatics 10.64898/2026.03.13.26348306 medRxiv

Top 0.1%

23.1%

Show abstract

IntroductionImplementing target trial emulation (TTE) study methods as end-to-end executable analytic code is technically demanding, and producing standardized, reproducible scripts consistently across research teams remains a persistent challenge. We aimed to develop a framework that translates free-text study descriptions into standardized analytic specifications and executable Strategus R scripts for the Observational Health Data Sciences and Informatics (OHDSI) ecosystem. MethodsWe developed THESEUS (Text-guided Health-study Estimation and Specification Engine Using Strategus), which operates through two sequential steps. Large language models (LLMs) first map descriptions of the study into a constrained JavaScript Object Notation (JSON) schema (standardization step), after which the structured specifications are converted into R scripts with a self-auditing loop for error correction (code generation step). We evaluated eight proprietary LLMs using texts extracted from the methods section of 15 OHDSI-based TTE studies, and externally validated the framework on texts from 5 non-OHDSI studies, across three input settings: primary analysis text only, full analyses text, and full methods sections. Standardization was evaluated at the study-level (whether all parameters in a study were correctly extracted) and at the field-level (sensitivity and false positive rate per individual parameter) with field-level evaluation applied to the full analyses text and full methods sections input settings. Code generation was assessed by executability of the produced R scripts before and after self-auditing. ResultsIn the standardization step, study-level accuracy across models ranged from 0.91 to 0.98 for primary analysis, 0.67 to 0.87 for full analyses, and 0.67 to 0.85 for full methods sections in OHDSI studies, whereas the corresponding ranges were 0.73 to 0.93, 0.60 to 0.87, and 0.27 to 0.47 in non-OHDSI studies. At the field-level, sensitivity across models under the full analyses text input setting ranged from 0.73 to 0.90 with 0.27 to 0.67 false positives per study in OHDSI studies, and from 0.71 to 0.90 with 0.20 to 1.00 false positives per study in non-OHDSI studies, depending on input setting. For code generation, first-run executability ranged from 0.80 to 1.00 for OHDSI studies and improved to 0.93 to 1.00 after self-auditing. In non-OHDSI studies, first-run executability ranged from 0.60 to 1.00, improving to 1.00 after self-auditing. DiscussionTHESEUS demonstrates that pairing a standardized data model with a structured analysis framework enables reliable LLM-powered automation of the coding step in observational research. THESEUS supports the reliable translation of natural-language study descriptions into executable, shareable code in standardized observational research settings. This approach has the potential to lower the technical barriers to participation in observational research for a broader range of investigators.